Jypter notebook

Before starting, let's take a look at the Jupyter notebook.

  1. Stopping and halting a kernel
  2. Looking at which notebooks are running
  3. Cells
  4. Adding cells above and below
  5. Changing type of cell from Markdown to Code
  6. Adding math

Class and objects

To import a module, you use the word import and then the name of the module

In [ ]:
import sklearn

You are able to import this because the module sklearn is already part of the Anaconda distribution. You can explore the modules that are part of sklearn by doing from sklearn import and then pressing Tab.

In [ ]:
# this it below
from sklearn import 

# this also works with submodules
from sklearn.linear_model import

In [ ]:
# from the submodule linear_model, lets import LinearRegression
from sklearn.linear_model import LinearRegression

Python is based on object-oriented programming (OOP).

  • Objects are containers of data and funcionality
  • Objects are of a class and that class might inherit funcionality from other classes
  • A class defines when and how the objects of that class would store data and how those objects would behave

The imported LinearRegression is a class definition. You can know the parents of a class by retrieving the __bases__ property

In [ ]:

To create an object, you call the class with parameters. To retrieve the possible parameters of class (or function) in the notebook, you can Shift-Tab (preview), double Shift-Tab (expanded window), triple Shift-Tab (expanded window with no time out), quadruple Shift-Tab (for split view of help)

In [ ]:
# try it below

Now, lets create a linear regression object

In [ ]:
lr = LinearRegression()

Again, we can explore that object by typing the name of object, then ., and then Tab

In [ ]:
# try it here

if we type lr into the notebook, we will get a customize description of the object

In [ ]:

we can obtain a more programmatically class description by calling the built-in type command

In [ ]:

Now, objects have a global identity

In [ ]:


sklearn has many datasets. We will take a diabetes dataset from it

In [ ]:
from sklearn.datasets import load_diabetes

In [ ]:
diabetes_ds = load_diabetes()

In [ ]:
X = diabetes_ds['data']
y = diabetes_ds['target']

sklearn works mostly with numpy array, which are $n$-dimensional arrays.

In [ ]:
[type(X), type(y)]

Numpy arrays

You can check the number of dimensions of an array

In [ ]:

Check the size of the dimensions

In [ ]:

Get slices of the dimensions. The following are all the same thing: grab the first two rows of a matrix

In [ ]:

In [ ]:

In [ ]:
X[0:2, :]

We can also grab columns in the same way

In [ ]:
X[:, 0:2]

Sometimes you want to grab just one column (feature), but the numpy returns a one dimensional object

In [ ]:
X[:, 2].shape

We can reshape the $nd$-array and add one dimension:

In [ ]:
X[:, 2].reshape([-1, 1])

In [ ]:
X[:, 2].reshape([-1, 1]).shape

You can do matrix algebra:

In [ ]:
# transpose

In [ ]:

For more functions, you can importa numpy

In [ ]:
import numpy.linalg as la

In [ ]:

Fitting models

OK, let's go back to our example with linear regression.

Usually sklearn objects starts by fitting the data, then either predicting or transforming new data. Predicting is usually for supervised learning and transforming is for unsupervised learning.

In [ ]:
# explore the parameters of fit

In [ ]:
lr2 = lr.fit(X[:, [2]], y)

fit returns an object. If we examine the id of the object it returns:

In [ ]:

In [ ]:

We realize that it is the same object lr, therefore, the call is fitting the data and modifying the internal structure of the object and it is returning itself.

Therefore, you can chain calls, which is very powerful feature.

Explore the fitted object

By looking at the online documentation of the LinearRegression, we can know the parameters it found.

In [ ]:

In [ ]:


In [ ]:
# explore the parameters

In [ ]:
y_pred = lr.predict(X[:, [2]])

Because we know how linear regression works, we can produce the predictions ourselves

In [ ]:
y_pred2 = lr.intercept_ + X[:, [2]].dot(lr.coef_)

In [ ]:
# this checks that all entries in the comparison are True
np.all(y_pred2 == y_pred)

Now, due to the powerful concept of chaining, we can combine fit and predict in one line

In [ ]:
y_pred3 = lr.fit(X[:, [2]], y).predict(X[:, [2]])

In [ ]:
np.all(y_pred3 == y_pred)

Additional packages

Sometimes you want to use a package that you found online. Many of these packages are available throught the Python Install Packages (PIP) package manager.

For example, the package quandl allows quants to load financial data in Python.

We can install it in the console simply by typing

pip install quandl

And now we should be able to import that package

In [ ]:
import quandl

In [ ]:
import quandl
mydata = quandl.get("YAHOO/AAPL")

In [ ]:


In [ ]:
# this helps put the plot results in the browser
%matplotlib inline

Pandas is a package for loading, manipulating, and display data sets. It tries to mimick the funcionality of data.frame in R

In [ ]:
import pandas as pd

Many packages return data in pandas DataFrame objects

In [ ]:
apple_stocks = quandl.get("YAHOO/AAPL")

In [ ]:

We can display the beginning of a data frame:

In [ ]:

In [ ]:

And also, we can plot it with pandas

In [ ]:

We can manipulate it too. Let's say we want to compute the stock returns

$$ r = \frac{V_t - V_{t-1}}{V_{t-1}} - 1$$

But for this, we need to compute a rolling filter

In [ ]:

In [ ]:

In [ ]:


Spark is a distributed in-memory big data analytics framework. It is hadoop on steriods.

Because we launched this jupyter notebook with pyspark, we have available automatically a variable called Spark context sc which gives us access to the master and therefore to the workers.

If we go to see the Spark dashboard (usually in port 4040), we can see some of the variables.

With Spark context you can read data from many sources, including HDFS (Hadoop File System), Hive, Amazon's S3, files, and databases.

In [ ]:
# explore the variables and functions availabe in the Spark context

Spark usually works with RDD (Resilient Distributed Dataset) and more recently they are moving towards DataFrame, which are similar to Pandas but distributed instead.

In [ ]:
rdd_example = sc.parallelize([1, 2, 3, 4, 5, 6, 7])

We can check the id of the RDD in the cluster

In [ ]:

In [ ]:
# this is a RDD

Let's explore the funcions we have available

In [ ]:

One such function is take that allows you to get a taste of what the file contains

In [ ]:

Let's say you want to apply an operation to each element of the list

In [ ]:
def square(x):
    return x**2

now we can apply that transformation to the RDD with the map function

In [ ]:
rdd_result = rdd_example.map(square)

Now you might notice that this returns immediately. Well, this is because operations on RDD are lazily evaluated

In [ ]:

So rdd_result is another RDD

In [ ]:

Now in fact, there is no duplication of data. Spark builds a computational graph that keeps tracks of dependencies and recomputes if something crashes.

We can take a look at the contents of the results by using take again. Since take is an action, it will trigger a job in the Spark cluster

In [ ]:

In [ ]:

In [ ]:

Usually, one you have your results, you write it back to Hadoop for later preprocessing, because they usually won't fit in memory.

In [ ]:
# this function can save into HDFS using Pickle (Python's internal) format

Spark's DataFrame

Now, DataFrame has some structure. Again, you can create them from different sources. In this case, DataFrame funcionality is available from another context called the sqlContext. This gives us access to SQL-like transformations.

In this example, we will use the sklearn diabetes dataset again

In [ ]:
from sklearn.datasets import load_diabetes
import pandas as pd

In [ ]:
diabetes_ds = load_diabetes()

To create a dataset useful for machine learning we need to use certain datatypes

In [ ]:
from pyspark.mllib.regression import LabeledPoint

In [ ]:

In [ ]:
from pyspark.ml.linalg import Vectors

In [ ]:

In [ ]:
Xy_df = sqlContext.createDataFrame([
        [float(l), Vectors.dense(d)] for d, l in zip(diabetes_ds['data'], diabetes_ds['target'])],
                                  ["y", "features"])

In [ ]:

We can register the table in Spark as an SQL

In [ ]:

And then run queries

In [ ]:
sql_result1_df = sqlContext.sql('select count(*) from Xy')

In [ ]:
# which again is lazily executed

In [ ]:

We can again run large scale regression using DataFrame

In [ ]:
from pyspark.ml.regression import LinearRegression

In [ ]:
lr_spark = LinearRegression(featuresCol='features', labelCol="y")

In [ ]:

In [ ]:
lr_results = lr_spark.fit(Xy_df)

In [ ]:

In [ ]: